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Abstract. It is well known among phylogeneticists that adding an extra 
taxon (e.g. species) to a data set can alter the structure of the optimal phy- 
logcnctic tree in surprising ways. However, little is known about this "rogue 
taxon" effect. In this paper we characterize the behavior of balanced mini- 
mum evolution (BME) phylogenetics on data sets of this type using tools from 
polyhedral geometry. First we show that for any distance matrix there exist 
distances to a "rogue taxon" such that the BME-optimal tree for the data set 
with the new taxon does not contain any nontrivial splits (bipartitions) of the 
optimal tree for the original data. Second, we prove a theorem which restricts 
the topology of BME-optimal trees for data sets of this type, thus showing 
that a rogue taxon cannot have an arbitrary effect on the optimal tree. Third, 
we construct polyhedral cones computationally which give complete answers 
for BME rogue taxon behavior when our original data fits a tree on four, five, 
and six taxa. We use these cones to derive sufficient conditions for rogue taxon 
behavior for four taxa, and to understand the frequency of the rogue taxon 
effect via simulation. 



1. Introduction 

Ideally, phylogenetic data sets would have the property that the optimal tree for 
a subset X of taxa Y would be the same as the tree obtained by restricting the 
optimal tree on Y to the set X. However, practicing phylogeneticists are well aware 
that this is not the case; the extensive literature on "taxon sampling" reviewed 
below is evidence to the contrary. One can also find references to "rogue taxa" 
which, although not clearly defined or rigorously investigated, are taxa who do 
not fit into a tree and whose inclusion may disrupt the inference of evolutionary 



relationships of the other taxa. For example, Sullivan and Swofford (19971 state 
". . . the hedgehog therefore appears to represent a 'rogue' taxon that cannot be 
placed reliably with these data and that possibly confounds attempts to estimate 
the relationships among the remaining taxa." The "rogue" descriptor is also used 



by Baurain et al. ( 2007 1 to describe taxa with a "strong nonphylogenetic signal" ; 
these authors describe the importance of finding and eliminating these taxa from 
phylogenetic studies. 

Surprisingly, we were unable to find any mathematical or simulation-based anal- 
ysis of the action of rogue taxa in phylogenetic trees. The closest studied subject is 
"taxon sampling." This area of research is focused on the following question: if we 



2010 Mathematics Subject Classification. 92B99 (92D15), 52B12. 

Key words and phrases, minimum evolution, distance-based phylogenetic inference, linear pro- 
gramming, polytope, normal fan. 

The first author was supported by a UC Berkeley Chancellor's Fellowship. The second author 
was supported by the Miller Institute for Basic Research at UC Berkeley. 



2 



MARIA ANGELICA CUETO AND FREDERICK A. MATSEN 



are interested in the phylogenetic tree on a set of taxa Y, do we do better or worse 
by adding more taxa into the tree? If better, is the improvement more significant 
than would be gained by increasing the length of the sequences (by redirecting 
resources)? 

The origins of the taxon sampling debate can be traced to the pioneering paper of 
Felsenstein ( |1978 ) that demonstrated mathematically the existence of "long branch 
attraction," where two pendant branches are artifactually placed close together 
by parsimony algorithms. This led to the question of if parsimony long branch 
problems could be dispensed with by adding new taxa to the dataset to break 
up the long branches; |Hendy and Penny ( 1989 ) have answered affirmatively under 
certain conditions. The investigation was continued by Kim (1996), who showed 
that the situation is subtle and that the new taxa must appear in specific regions 
of the tree in order to counter the long branch attraction problem. 

These mathematical investigations of parsimony were followed by a flood of 
simulation-based papers investigating maximum likelihood, parsimony, as well as 
distance methods for phylogenetics. |Hillis| ( |l996| , Graybeal (19981, and Poe (1998) 
indicated that a larger number of taxa improved estimation, whereas the high-profile 
publication of Rosenberg and Kumar| ( 2001 ) claimed the opposite. The Hillis group 
respon ded (|Zwickl and Hillis| |2002| |Pollock et al.| |2002| |Hillis et al.j |2003[ ) which 
led to Rosenberg and Kumar (2003) somewhat moderating their position. The 
debate on taxon sampling has continued to the present day, with additional simu- 



lations ( |Poe| |2003[ |DeBry 2005 Hedtke et al. 2006), review articles ( Heath et al 
2008a[ ), and studies to understand the impact of taxon sampling on the inference 



of macroevolutionary processes (Heath et al. 2008b). The simulation literature 



in this area is considered important enough to even have a paper (Rannala et al. 



1998 ) about methodology for taxon-sampling simulations. 



There are two inherent difficulties with simulations of this type. First, the col- 
lection of possible parameter values for simulation is vast, and any simulation study 
must make choices about which parameters to use. This first problem alone may be 
the source of the disagreement found in the taxon selection literature. Second, the 
simulations are done by simulating data with a single model on a tree, then recon- 
structing. This does not address the problem of what happens when considering 
unusual data sets, such as those obtained by major model misspecifications. 

A mathematical approach can address these difficulties, although with certain 
caveats. Theorems can indicate that a phenomenon will always happen given cer- 
tain criteria, and the construction of the complete spaces of examples or counter- 
examples gives very precise information about these questions. By exploring the 
complete space of data sets of a certain type, one is not limited to data sets which are 
within a certain class of models. The trade-off for the strength of these conclusions 
is that often the setting must be simplified to make the problem mathematically 
tractable. 

In order to address taxon selection and the rogue taxon effect problem mathemat- 
ically, we have chosen to use distance-based phylogenetics, specifically the Balanced 
Minimum Evolution (BME, described below) criterion. Because the optimality cri- 
terion is expressed in terms of the minimization of an inner product, we are able to 
harness the power of polyhedral geometry to answer the questions of interest with a 
high degree of precision. Although BME-based algorithms are not among the most 
popular in phylogenetics, implementations do exist which show good performance 
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under simulation (Desper and Gascuel 2002b). The BME criterion is consistent 



(Desper and Gascuel 2004), as is FastBME which minimizes BME through tree 
rearrangements (Bordewich et al. 2009). Another motivation for studying BME is 



the close relationship between BME and the very popular Neighbor- Joining (NJ) 
algorithm (Saitou and Nei 1987). Specifically, NJ has been shown to be a heuris- 
tic BME minimizer (iDesper and Gascuel 2005); the relationship between the two 



algorithms has been investigated by Eickmeyer et al. ( 2008 1 . 



After describing a bit of terminology, we will discuss the main results of the paper. 
Note that by dissimilarity map we simply mean a mapping D from unordered 
pairs of taxa to non-negative numbers such that D(x,x) — for all x. These are 
sometimes called "distance matrices" in the phylogenetics literature but we use 
dissimilarity map to emphasize that they need not satisfy the triangle inequality. 

Definition 1.1. Let t be a phylogenetic tree equipped with branch lengths b. The 
tree metric associated with t and b is the dissimilarity map obtained as follows: the 
distance between taxa i and j of t is given by the total length ( i. e. sum of branch 
lengths) of the path from i to j in t with respect to b. 

Next we define some core objects of study for this paper. 

Definition 1.2. Let D be a dissimilarity map on n taxa. A "lifting" D of D is a 
dissimilarity map on n + 1 taxa obtained from D by adding distances from the first 
n taxa to an (n + l)st taxon. 

Definition 1.3. Let D be a dissimilarity map on n taxa, and let D be a lifting of 
D. The BME tree for D will be called the "lower tree," while the BME tree for D 
will be called the "upper tree. " The "restricted upper tree " will be the tree induced 
on the original n taxa by restricting the upper tree to this set. 

Our primary goal is to understand topological differences between the upper and 
lower trees for various original dissimilarity maps D and various liftings D. 

1.1. Overview of the paper. The first section describes the effect of adding a 
new taxon when the original dissimilarity map D is arbitrary. Theorem |3.2| shows 
that for any D there exists a lifting such that the intersection of the split sets for 
the restricted upper tree and the lower tree consists of the trivial pendant splits. 
In other words, we show that the restricted upper tree and the lower tree can be 



maximally distant in terms of the Robinson- Foulds metric ( Robinson and Foulds 



1981). However, the upper tree cannot deviate from the lower in an arbitrary 
way: Theorem |3.5| shows that certain combinations of lower and upper trees are 
not possible. We also note that the trees of Theorem |3.2| need not be maximally 



distant in terms of the quartet distance (Remark 3.4). 

The second section addresses the case when the original dissimilarity map D is a 
tree metric for some tree t; in this setting there is no question of what the optimal 
tree for the lower taxa "should" be. That is, if the upper tree does not contain the 
lower tree, the additional taxon is definitely a disrupting "rogue" taxon. When D is 
a tree metric, there exists a simplified formulation of the BME computations. This 
"reduced" formulation has a linear rather than a quadratic number of variables, and 
allows polyhedral computation directly over the parameters of interest. We study 
the associated "reduced polytope" and several of its combinatorial and geometric 
properties, including its dimension. Using this "reduced" formulation we are able 



to give sufficient conditions (Propositions 4.14 and 4.15) for the rogue taxon effect 
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when the lower tree has four taxa, as well as a perspective on the frequency of the 
rogue effect through simulations for up to six lower taxa. 

The computations in this paper were done with a combination of Gf an ( |Jensen| 



2009), Polymake (Gawrilow and Joswig 2000), and custom ocaml code using GSL, 



the GNU scientific library. For the interested reader, source code is available at 



http : //github . com/matsen/roguebme. 



2. Polyhedral geometry and BME phylogenetics 

In this section, we introduce the mathematical problem we wish to investigate 
and walk through the necessary background in polyhedral geometry. We start 
by defining the Balanced Minimum Evolution (BME) criterion for phylogenetic 
inference. 

For the purposes of this paper, all trees will be unrooted phylogenetic trees. We 
will use parenthetical "Newick format" to describe trees, such that ((a, 6), (c, d),e) 
indicates a five taxon tree with the pairs a, b and c, d being sister taxa ( |Felsenstein{ 



2004). Sometimes we will write these unrooted trees in a rooted manner, as we feel 
that ((a, 6), (c, d)) is clearer than (a, 6, (c, d)). The degree-two vertex of the rooted 
representation should be suppressed. Trivalent trees are trees such that all internal 
nodes have degree three. 

Definition 2.1. Given a dissimilarity map D in the "Balanced Minimum 

Evolution" (BME) length of a phylogenetic tree t with respect to a dissimilarity 
map D is the quantity 

(1) A(t,D):= Yl 

l<i<j<n 

where ojjj = rXugp* (deg(u) — 1) _ , and p\j denotes the internal vertices in t on the 
path between leaves i and j . 

Remark 2.2. In the case of a trivalent tree t, the weight uj 1 ^ equals 2~' Pi j'. 

A BME tree for an n x n non-negative matrix D will be a tree t minimizing 
A(i, D) over all n-taxon trees. The BME algorithm is consistent on trivalent trees: 
if D is tree metric with trivalent tree topology t, then the BME tree of D is t 



(Desper and Gascuel, 2004) 



Note that there is a volume-zero set of dissimilarity maps with multiple optimal 
BME trees, and therefore it is not quite right to speak of "the" BME tree. All of 
our statements are true by replacing "the BME tree" with "a BME tree" , however, 
we prefer stating the former. More precisely, given a dissimilarity map, we have 
two cases: either the set of a possible BME trees of D consist of a single (trivalent) 
tree, or the set has size at least two and it is closed under degenerations. That is, 
if a trivalent tree t contracts to a BME tree for D, then t is also a BME tree for D; 
this claim will be clear from the polyhedral perspective described below. 



There are several equivalent formulations of the BME length (Eickmeycr ct al 
2008), although we prefer ([I]) because of its polyhedral interpretation. 



Global BME minimization is known to be hard (Guillemot and Pardi 2009). 



The widely used Neighbor- Joining algorithm approaches the BME problem from a 



greedy perspective ( Studier and Keppler 1988[ ) . The Fastme algorithm starts with 



a hcuristically obtained tree and then refines it using Nearest Neighbor Interchange 
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(NNI) to attempt to find the BME minimal tree ( |Desper and GascueH |2004[ ) . A 
better understanding of the BME polytope (defined below) could lead to better 



such algorithms ( Desper and Gascuel 2002a I , analogous to how understanding the 



traveling salesman polytope provides insight into the traveling salesman problem 



(Padberg and Grotschel 1985) 



We now introduce the BME polytope, first investigated by Eickme yer et al.j 
(2008). A polytope in M. m is the convex hull of a finite number of points in R m . Fix 

a positive integer n. The BME polytope in rCO is the convex hull of the points 



(UJ 



where t varies among all possible tree topologies on n taxa. 



Using this polyhedral interpretation, the problem of finding the BME-optimal 
tree tonn taxa corresponds to picking a vertex w* of the BME polytope minimizing 
the Euclidean dot product of the vertex with a given dissimilarity map (considered 

as a vector in The BME tree is the tree associated to this vertex. 

We can characterize this optimization process by constructing the corresponding 
inner normal fan. The inner normal fan of a polytope V C M. N is given as a finite 
collection of cones (i.e. a set closed under multiplication by positive scalars) as 
follows. Each cone in the inner normal fan of V corresponds to a face T of the 
polytope V and is defined as 



(2) 



5jf 



{w £ R JV : (w,v) = min{(w, u) :«£?}, V»£ F}, 



i.e. those vectors such that the minimum inner product is achieved at all points of 
the face T . 

By construction, each cone is polyhedral: it is the solution set of a system of 
linear inequalities. As such, it can be expressed as the positive span (i.e. using non- 
negative scalars) of finitely many vectors, which we call extremal rays. In addition, 
the inner normal fan of V is a polyhedral fan because the family {*^f : T C V face} 
is closed under intersections. Moreover, this fan is complete (i.e. the union of all 
cones equals the ambient space R^) and each cone has dimension equal to 
codim J- — N — dim J 7 , where dim J 7 denotes the dimension of the affine span of 
face T . In particular, if J 7 is a vertex, then is full dimensional. We call these 
full-dimensional cones chambers. The inner normal fan of the BME polytope will 
be referred to as the BME fan. We refer the reader to (Ewald 1996 Chapter 1) 
for a complete exposition of normal fans. 

Remark 2.3. From the previous discussion we see that the BME criterion is equiv- 
alent to the membership of a dissimilarity map D to a chamber in the BME fan. 
Thus D belongs to the interior of a chamber in the BME fan if and only if the BME 
tree of D is unique. The boundary of these chambers is the volume zero set having 
multiple BME trees (discussed earlier in this section). 

Since the BME polytope encodes the problem of finding the BME tree of a 
dissimilarity map, it is worth understanding its structure. Some of its combinatorial 
properties have been studied for small number of taxa, although several questions 
remain open for n > 6. We investigate some of its features below, as described by 



(Eickmeyer et al. 2008) 



The vertices of the BME polytope corresp ond to the points (c^f,)^ where t is a 

5)!! vertices ( Pachter and Sturmfels| 2005 Lemma 
5) 



trivalent tree, for a total of (2n 
2.33). Here, (2n - 5)!! = (2n 



(2n — 3) • • • 3 • 1. In addition, the vector wfj 
associated to the star tree s (the tree with a single internal node) lies in the interior 
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of the polytope, whereas all other points ur lie on its boundary (Eickmeyer et al. 
20081 Lemma 2.1). 



The dimension of the BME polytope (i.e. the dimension of the affine space 
spanned by this polytope) is — n. The polytope is not full-dimensional because, 
after translation to the origin, the orthogonal complement of its affine span is 
spanned by the n shift vectors {h a : a e {1, . . . , n}}. Here, the shift vector h a refers 
to a dissimilarity map in which leaf a is at distance 1 from all other leaves, while 
all other pairwise distances are 0. 

The f -vector f(V) C R of an TV-dimensional polytope V gives the number 
of faces of each dimension of V. That is, f{V)i = #{ faces of dimension i — 1 of 
V}. The /-vectors of BME polytopes have been studied for up to seven taxa. In 
particular, for four and five taxa, these vectors have been completely described in 
( |Eickmeyer et al. 2008 Table 1), whereas for six and seven taxa some of the entries 
of the /-vector have remained unknown up to now. We were able to compute the 
complete /-vector for six taxa by methods of tropical geometry, using Gf an. The 
resulting /-vector is: 

(105, 5460, 105945, 635265, 1715455, 2373345, 1742445, 640140, 90262). 

In particular, we see that the polytope has 90262 facets. It also has 105 vertices, 
labeled by all trivalent trees on six taxa. 

As a corollary of these computations, it follows that the edge graph of the BME 
polytope for six taxa is the complete graph if 105 ( Eickmeyer et al. 2008 ). This says 
that any two vertices of the BME polytope can be connected by an edge. Similar 
behavior occurs for four and five taxa, but this is no longer true for seven or more 
taxa (Eickmeyer et al. 20081. 

By construction, the BME polytope comes equipped with a natural symmetry 
given by the symmetric group S„ on n elements. Namely, relabeling the leaves of 
a trivalent tree t by a permutation u £ §„ sends t to the relabeled trivalent tree 
at, and hence the vertex w* to ui at . In a similar way, higher dimensional faces 
of the BME polytope will have this symmetry. Therefore, we can encode these 
symmetries in the /-vector, and record the number of faces of each dimension, up 
to the combinatorial action of S n on all faces. In the case of six taxa, we get: 

(2, 20, 182, 982, 2492, 3489, 2626, 1032, 169). 

We illustrate these constructions and their properties in the case of four taxa. 

Example 2.4. (Eickmeyer et al. 2008) Fix n = 4. The points uj 1 are: 



((1,2), (3,4)) 



~ P, 1,1, 1,1, 2] 



,((1,3), (2,4)) 



-[1,2,1,1,2,1] 



^((1,4)5,(2,3)) = ^1,1,2,2,1,1] ; u, s ^ 4 > = V 1,1, 1,1,1] ; 



The BME polytope is a triangle in R 6 with vertices w^ 1 - 2 ^ 3 * 4 )', W ((L3),(2,4)) 
w ((i, 4 ),(2,3)) _ j£ S p ans tfo e 2- dimensional space {{x\2, £13, X14, X23, X24, ^34) £ 

X\2 + Xi 3 + Xu = X12 + X 23 + X 2 4 = X13 + X 2 3 + X 3i = X U + X 2 4 + £34 = !}• 



and 

J6 . 



The lineality space of a fan is defined as the maximal linear space contained in 
all cones of the fan. If this space is just the origin, we say that the fan is pointed. 
In the case of the BME fan, this linear subspace is ri-dimensional with basis given 
by the n shift vectors h a corresponding to the n leaves. Since the lineality space 
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lies in all cones of the fan, we can mod out by this subspace (for example, by taking 
a projection to its orthogonal complement) and reduce our study to the case of 

pointed complete polyhedral fans in rW - ™. We illustrate the construction of the 
BME fan and the associated pointed fan on four taxa. 

Example 2.5. Let n = 4. We mod out by the lineality space L — \h\, /i 3 , /i 4 ) 

via the canonical projection map p : rGO -> L 1 - ~ MOD"" to the orthogonal com- 
plement of the subspace L given by the matrix 

( 1 -1 -1 1 \ 
\ 1 -1 -1 1 J • 

W^e apply this projection to the BME fan, and we get a fan in K 2 , which we can 
plot. Alternatively, we project the BME polytope into 2-space and we take the inner 
normal fan of the resulting polytope. 

From Example \2.4\ we know that the BME polytope is the triangle with vertices 
corresponding to the three quartet trees ((1, 2), (3, 4)), ((1, 3), (2, 4)) and ((1, 4), (2, 3)) 
The projection p maps this triangle to the triangle with vertices (—2,4), (4,0) and 
(—2,-2). Its inner normal fan consists of the rays spanned by T\ = (1,0), r-i 
: 1 . —1) and r^ = (0, 1), plus the origin. F igure^ shows the quartets corresponding 
to the relative interior of each chamber. o 

rz = (o, i) 

A 




Figure 1. Quartets minimizing the BME criterion for each dis- 
similarity map on four taxa. 



3. Behavior of BME under the addition of an extra taxon 

The purpose of this section is to investigate the relationship between lower and 
upper trees for arbitrary D. Section 3.1 shows that for any D there exists a lifting 
such that the upper tree is as different as possible from the lower tree in terms of 
splits. Section |3 . 2 1 provides a counterpoint by demonstrating that certain combina- 
tions of lower and upper trees are not possible, i.e. that a rogue taxon cannot affect 
a BME tree in arbitrary ways. 
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Notation 3.1. Throughout the remainder of the paper, we label our taxa by [n] — 
{1, . . . , n}. We write K + for the set of non-negative reals. 

3.1. A theorem demonstrating the existence of unusual upper trees. We 

show that every lower tree has an upper tree whose restriction to the lower taxa 
is maximally different from it in terms of the Robins on- Foulds metric Srf on tree 
topologies, although perhaps not in terms of quartet distance. The Srf metric on 
phylogenetic trees is defined in terms of bipartitions in the tree, also called "splits." 
A split in a phylogenetic tree is simply the bipartition of the taxa induced by cutting 
that edge. For example, the split {1,2}, {3, 4} is induced by cutting the internal 
edge of the quartet ((1, 2), (3, 4)). Let £(t) denote the set of splits of tree t; the 
distance 8rf(s, t) is simply one half the size of the symmetric difference of £(t) and 



S(s) (Robinson and Foulds[ 1981). 



The quartet distance is analogous to the Robinson-Foulds distance but with the 
role of splits replaced by that of quartets (induced subtrees of size four) contained 
in a tree. The naive algorithm for computation is Q(n ), although it be com- 



puted in 0(n ) via a simple algorithm (Bryant et al. 2000) and in 0(n log n) via a 



more complex algorithm ( jBrodal et al. 2004). In this paper, s will have one more 



taxon than t; we accommodate this difference for the Robinson-Foulds and quartet 
distances by simply taking the induced tree on s given by the set of lower taxa. 

Theorem 3.2. Let D be a dissimilarity map on n taxa with BME tree t. There 
exists a lifting D whose upper tree s maximizes 5nF{s,t) among all trees on n taxa. 

This theorem will follow easily from the following lemma. 

Lemma 3.3. Given an ordering of n taxa Z\, . . . ,z n and any distance matrix D 
on taxa {zi : 1 < i < n\, there exists a lifting D such that the BME tree for D 
restricted to z\, . . . , z n is the caterpillar tree {z\, (z2, . . . , (z n —i, z n ) . . .). 

Proof. Pick arbitrary numbers 1 < a\ < ■ ■ ■ < a n . Let y denote the extra "rogue" 
taxon. We construct a family of liftings D c as an exponential function for a given 
base number c. Set D c (y,Zi) — c ai . 
We write the BME length as 

l<i<j<n l<i<n 

As c goes to infinity, the dominant term in the summation becomes , i+1 c Q " . For 
c greater than some c„, the BME tree must be a caterpillar tree with y as far as 
possible from z n . Indeed, any other topology would have a smaller coefficient for 
c Q ". We can repeat the same argument replacing n — 1 for n, finding a c„_i such 
that for c > c„_i the BME tree must be a caterpillar tree with y as far as possible 
from the subtree (z„_i, z n ). Continue in this way until a large enough lower bound 
on c is found such that the described caterpillar tree is the BME tree for D c . □ 



With this lemma, all that is needed to prove Theorem |3.2| is to show that there 
exists a caterpillar tree s such that the restriction of the caterpillar to the original 
taxa has maximal 5nF(s,t). 



Proof of Theorem \3.S\ Color the taxa of t with black and white colors as follows: 
for every cherry (two taxon subtree) of t, color one taxon white and the other black, 
and color the remaining taxa arbitrarily. Now order the taxa with all of the black 
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taxa first and all of the white taxa second. The caterpillar tree from Lemma |3.3| 
using this ordering will have the required maximal S^p. □ 



Remark 3.4. The extension of Theorem \3.£\ to quartet distances does not hold 
for more than seven taxa. Indeed, let t be (1, ((((((2, 3), 4), 5), 6), 7), 8)). The max- 
imally quartet- distant trees on 8 taxa (of quartet distance 61) are the following 
non- caterpillars: 

(1,(2, ((((3, 8), 5), (4, 7)), 6))) 
(1,(2,((((3,8),6),(4,7)),5))) 
(1, (((((2, 8), 5), (4, 7)), 6), 3)) 
(1,(((((2,8),6),(4,7)),5),3)). 
These trees were found by our code and distances were confirmed with the qdist 



program of Mailund and Pedersen (200^). 



One could perform a similar analysis for the path distance metric of St eel and| 
Penny (1993), although we have not done so. 



3.2. A theorem restricting topology of upper trees. The previous section 
shows that the lower and upper trees can be quite different. It is natural then to 
ask about the collection of possible upper trees for a given lower tree. That is, if 
we have a dissimilarity map D on n taxa with BME tree t, what are the possible 
BME trees s for liftings of D? This question narrows the potential effect of rogue 
taxa. 

We first gain intuition by investigating the case of four taxa. This setting is 
simple, as there is only one trivalent tree topology on five taxa (up to relabeling of 
its leaves). 

Using Polymake one can show that all but two tree topologies can be realized as 
upper trees for a lower quartet. The two trees not above ((1, 2), (3, 4)) are shown 
in Figure [2j 



FIGURE 2. The trees that do not sit above ((1,2), (3,4)) for any 
lifting of a dissimilarity map D with BME tree ((1, 2), (3, 4)). 



This example can be established analytically and generalized to the case of more 
taxa by replacing the leaves 1 through 4 with rooted subtrees a through d. In 
particular, we show that we can never obtain a tree where pairs of subtrees are 
exchanged "over" the extra taxon. 

Let y denote the new leaf to be attached. The original tree t is the tree 
((a, 6), {c,d)). Call s the tree ((a, c), (b, d)) as in Figure|3] 

Theorem 3.5. Let D be a dissimilarity map such the BME score oft = ((a, b), (c, d)) 
is strictly less than that of s = ((a, c), (b, d)) (Figure [$|). Then the BME score of 
t y := ((a, 6), y, (c, d)) is strictly less than that of s y := ((a, c), y, (b, d)) for any lifting 
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Figure 3. The trees t, s, t y and s y . 



D of D. Consequently, if t is the BME tree for D, then s y cannot be a BME tree 
for any lifting D. 

Proof. We denote with sans serif font the elements in each subtree, so a denotes a 
leaf in subtree a, etc. For simplicity we abbreviate or by tu. By definition, we get 

^ab = ; Wac = 2w ac ; u 8 a l = Wad/2 ; = w bc/2 ; w^d = 2oJ bd ; = cj cd /4; 



w a b ; ^ac = Wac/2 ; cu^ = cj ad /2 ; = uj bc /2 ; u b v d = uj bd /2 ; w c 3 = w cd . 



Similarly, 

w a s b = ^ab/2 ; w a s c 



2w ac S W a S d 



w ad ; w bc — w bc ; 



^bd 



2wbd ; = w cd /2. 



Since we are interested in the difference between the two scores, we do not compute 
the weights w.r.t. leaf y nor weights within a cluster, since both trees have the same 
weight in these two cases. Then for any given lifting D wc have by subtraction 



X(s y ,D) - X(t y , D) = 3/2(A(s, D) - X(t, D)). 
The term on the right-hand side is positive by hypothesis. 



□ 



4. Liftings of tree metrics 

In the previous section, we analyzed the relationship between the lower and upper 
trees for liftings of a general dissimilarity map D. For a practicing phylogeneticist, 
however, this provides limited useful information. Indeed, the basic assumption 
of phylogenetic inference is that the data evolves in a primarily tree-like manner. 
Namely, in distance-based inference, the assumption is that the given dissimilarity 
map is "close" to a tree metric. In the rogue setting, we are interested in n taxa 
which evolve in a tree-like manner and one, the rogue, that does not. 

In this section we formalize these notions by assuming that D is a tree metric 
with respect to the tree topology t. By the consistency of BME inference, the lower 
tree will be t. With this assumption, our primary interest will be in understanding 
how the upper tree can differ from t in the sorts of situations more likely to be 



encountered in phylogenetics. Although Theorem 3.2 provides an interesting the- 
oretical result in this vein, the required lifting is quite unlikely to appear in data. 
By reformulating the problem below directly in terms of the branch lengths of the 
tree metric, we are able to obtain more precise and relevant information about the 
action of rogue taxa. 
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4.1. Preliminaries. 

Notation 4.1. Given a positive integer n, we define 2l n to be the cone of dissim- 
ilarity maps on n taxa. We identify & n with M^ 2 . Similarly, we define .!^ n C S! n 
to be the space of tree metrics on n taxa. We omit the subscript n whenever it is 
clear from the context. Finally, given a tree topology t, we denote by 2? t C 2? n the 
set of tree metrics with underlying tree topology t. 

Notation 4.2. Given a trivalent tree t, the BME cone 1f u t associated to t will be 

denoted by c £ t . Moreover, we call = the positive BME cone oft, also 

known as the BME cone of dissimilarity maps associated to t. 

Notation 4.3. In what follows, we write V n for the BME polytope on n taxa. If 
the number of taxa is understood, we omit the subscript. 

Given a tree topology i on n taxa, let n t : R+ 2 ^ — > R 2 ™~ 3 denote a map gen- 
eralizing the branch length map for tree metrics as follows. The coordinates of 
this map are indexed by the branches of the tree t, and each coordinate is a linear 
function on the metric cone whose value on tree metrics with topology t is pre- 
cisely the length of the corresponding edge. Note that this linear function is not 
unique, and it is positive on tree metrics with topology t. An expression defining 
the coordinate e of the map 7r t (that is, the branch length of e) can be obtained 



by the four-point condition equations (Pachter and Sturmfels |2005| Theorem 2.36) 



characterizing the tree topology t. For example, let t = ((1, 2), (3, 4)), let ej be the 
edge adjacent to leaf i, let e be the internal edge, and let b ei , b e be their correspond- 
ing lengths. Then7T t (£>) := (b ei (D),b e2 (D),b e3 {D),b ei (D), b e (D)), where b ei (D) = 
(D 31 ~D 32 + D 12 )/2, b e2 (D) = (D 32 -D 31 +D 12 )/2, b e3 (D) = (D 23 - D 2A + D 3A )/2, 
b ei (D) = (D 2i - D 23 + D 3i )/2, and b e (D) = (D 13 + D 2i - D 12 - D 3i )/2. The map 
7r t has the property that it identifies the cone of tree metrics realizing t with Mi" -3 . 

Our goal for this subsection is to understand the interplay between the branch 
lengths of a tree metric D 6 and the possible upper trees one can obtain by 
lifting this metric. In particular, we wish to characterize the branch lengths of lower 
trees admitting a prescribed upper tree s. It is clear that if we start from a tree 
metric D — d t and its corresponding branch length vector ir t (D), we can easily lift 
D to a tree metric D whose underlying tree s contains t as a subtree. Hence, the 
union of the sets {irt(D): D s.t. 3D E as s varies among a possible upper 

BME trees equals the set IR^ n ~ 3 . We want to understand each one of these sets. In 
particular, we want to answer the following challenge: 

Problem 4.4. Given a tree topology t on n taxa and s € S? n +i, describe the cone 
of dissimilarity maps on n + 1 taxa whose BME tree equals s and whose restriction 
to the first n taxa is a tree metric of combinatorial type t. 



For each upper tree s, the elements of the corresponding set in Problem |4.4| can 
be thought of as vectors in M 3 ^ -3 , where the first 2n — 3 entries encode the branch 
lengths of the lower tree t and the remaining ones refer to distances to the new 
taxon. That is, 

(3) X s {t) :={{TT t {D),b ltn+1 ,...,b n>n+1 ):De3r u b 

By construction, these sets are polyhedral cones and they partition the set K 3 ™ -3 : 
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Proposition 4.5. X s (t) is a rational (possibly empty) polyhedral cone for every s 
and t. It is described by two types of homogeneous linear constraints: 

• all entries Dij > and TVt(D) > 0. 

• inequalities describing they correspond to the directions u> s — lo u for 
all trivalent trees u on n + 1 taxa, and all constants are zero. That is: 
(lu s — uj u ,D) > 0, for all trivalent trees u. 

Proof. X s (t) is a polyhedral cone because it is the image of the linear map D i— > 
{n t {D \[ n ] ),£>i jn+ i, . . . ,D ntn+1 ), where D e c tf + n (,% x M"). The inequalities 
describing X s (t) follow by construction. The entries of D | [„] are expressed as 
linear combinations of the entries 7r t (D \ [ n ])- The second group of inequalities 
include facet inequalities of the cone ^ : whose directions are given by the edges 
containing vertex tu s . To simplify the construction, we add the inequalities coming 
from differences between uj s and all other vertices of V and not only of vertices 
w" adjacent to uj s . Adding these inequalities makes no harm and it simplifies the 
problem by avoiding the computation of the edges adjacent to uj s , which can be 
hard if the number of taxa is too big. □ 

4.2. The reduced BME polytope. We now present an equivalent approach to 
our lifting task in the setting of this section, i.e. when D is a tree metric on n taxa 
with (trivalent) tree t and branch lengths b e . As shown below, all that is needed 
to study the restricted BME problem is a change of order of summation followed 
by a grouping of appropriate terms. This small modification reduces the problem 
from having a quadratic number of free variables to a linear number, as well as 
simplifying the constraints. After introducing the reduced polytope, we show that 
it has dimension 2n — 4 by characterizing its affinc hull. 

The set of edges of t will be denoted by E(t). Pick any lifting D of D, and any 
tree s with n + 1 leaves. The BME length of s with respect to D can be calculated 
as follows: 

n 

X(s,D) = (u s ,D)= ]T w?-Aj + £w?„+iA,n+i. 
Now we simply substitute in the definition of the dissimilarity map D: 

D i,j = be ' 

where e £ i(i o j) indicates that edge e € E(t) lies in the path between leaves i 
and j in tree t. Exchanging order of summation and regrouping, 



(4) 



eeE(t) ^ i,j#n+l ' i=l 



which is again a simple inner product with a rational vector. For a tree sonn+1 
taxa, define (V). e M 3 ™~ 3 by 



(5) 



(v s ) e — ^iji e e< ige of lower tree 
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Note that this definition depends on the fixed tree t, but we do not incorporate it 
to the notation, as we will typically be fixing a lower tree. 

To find the BME tree for a tree metric (t, {b e } e g_E(t)), we build a vector v s E 
R 3 ™" 3 for each tree s € S^n+i- Each vector has entries indexed by the 2n — 3 edges 
of t and the n distances {Di n+ \ : i — 1, . . . ,n}. Our goal is to find s minimizing 
the quantity Q . As in the case of the BME problem, we build a polytope B l (here 
in (3n — 3)-space) which is the convex hull of the points v a and study its properties. 

Definition 4.6. Fix a tree t on n taxa and consider the points {v s ) e ,i as in (|5j). 
The convex hull of these points is called the "reduced BME polytope" , and we denote 
it by B l . It only depends on the combinatorial type of the tree t and it is symmetric 
with respect of the group of symmetries of the tree t. The points {v s : s € 3~n+\\ 
are called "reduced weights. " The inner normal fan of B l is called the "reduced 
fan." Cones in this fan are called "reduced cones" and their intersections with the 
positive orthant are be called "positive reduced cones. " 

From the previous construction it is clear that the BME polytope and the reduced 
BME polytope are closely related. We now explain this connection. The linear map 
at ■ 2 ) — > R 3 ™~ 3 assigning the reduced weight v s to the BME weight uj s sends 
the polytope V surjectively onto the polytope B l . That is, the reduced polytope is a 
linear projection of the BME polytope. On the dual side, the dual of the linear map 
a t will inject the dual space of the polytope B l into the dual space of the polytope 
V, and in this case the linear spaces of both polytopes are identified by the map at 



(Proposition 4.9 1. We refer the interested reader to (Section 7.2, Ziegler 2006) for 



more information about projections of polytopes. 

Example 4.7. We illustrate the previous construction in the case of liftings of 
the quartet tree t — ((1, 2), (3, 4)), describing the reduced weights v s for six trivalent 
trees s in Ta&Ze[7J The remaining reduced weights can be obtained by relabelings of s 
that respect the combinatorial type oft. The table is organized as follows. The first 
five columns encode the branch lengths of the lower tree: bo for the internal edge 
oft, and hi for the edge pendant to taxon i. The rest, x\ through x<± are the four 
distances to the new taxon. The polytope fiCC 1 ' 2 )^ 3 ' 4 )) c K 9 is four-dimensional, 
has 14 vertices and f -vector (14,46,52,20). The vertices of V5 corresponding to 
the trees ((1, 3), (5, (2, 4))) and ((1, 4), (5, (2, 3))) project to the same vertex of B* . 
Among all 14 vertices, only 5 correspond to upper BME trees: the reduced weight 
corresponding to the tree s — ((2, 5), (3, (1, 4))) and its five relabelings that fix t. 
The affine hull of B l has five defining linear equations X\ + x-i + x 3 + x 4 = 1 and 
hi + Xi for i = 1,2,3,4. Analogous equations will define the affine hull for all re- 



duced BME polytopes, as we show in Proposition 4-9 o 



One can compute the dimension, number of vertices, and /-vector of the reduced 
polytope B* as we did in the case of the BME polytope. We can also study the 
behavior of the vertices of the BME polytope under the projection map, and see 
how many of its vertices collapse to a single vertex in B l \ how many lie in the 
interior and how many lie in proper faces of positive dimension. We now show that 
the reduced polytope has dimension In — 4 by characterizing its affine hull. First 
we state a technical lemma. Questions involving vertices and their behavior under 
the projection map will be deferred to the next section. 
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upper tree 


bi 


b 2 


b 3 


b 4 


b 


Xi 


X2 


X3 


X4 


((1,2), (3, (4,5))) 


7/8 


7/8 


6/8 


4/8 


6/8 


1/8 


1/8 


2/8 


4/8 


((1,2), (5, (3,4))) 


6/8 


6/8 


6/8 


6/8 


4/8 


2/8 


2/8 


2/8 


2/8 


((1,3), (2, (4, 5))) 


7/8 


6/8 


7/8 


4/8 


9/8 


1/8 


2/8 


1/8 


4/8 


((1,3), (5, (2,4))) 


6/8 


6/8 


6/8 


6/8 


10/8 


2/8 


2/8 


2/8 


2/8 


((1,3), (4, (2, 5))) 


7/8 


4/8 


7/8 


6/8 


9/8 


1/8 


4/8 


1/8 


2/8 


((1,5), (2, (3,4))) 


4/8 


6/8 


7/8 


7/8 


6/8 


4/8 


2/8 


1/8 


1/8 



Table 1 . Reduced weights for trivalent trees on five taxa, starting 
from the lower tree t = ((1, 2), (3, 4)), up to symmetry of the lower 
tree t. The column labels show the quantity for which the entry 
is the corresponding coefficient in the reduced weight vector: e.g. 
the first entry of the table shows that 7/8 is the coefficient of hi 
for topology ((1,2), (3, (4, 5))). 



Lemma 4.8. Given a tree t on n taxa, let ui denote the BME weight for t. Then 

=1 V 1< i < n. 



Proof. If a non-backtracking random walk starts at i, then is the probability of 



that walk ending at j. 



a 



Proposition 4.9. The affine hull of B l is characterized by n + 1 linearly indepen- 
dent linear equations. More precisely, they are given by Ax = 1 € M n+1 , where 



A 



In 





In 











g(n+l)x(3n-3) 



and the columns of A and points in J$ 3 ™~ 3 are labeled by partitioning the coordinates 
as (b ei , . . . , h Pn | b e : e interior edges oft | D\ n +\, . . . ,D n ^ n+ i). Here, denotes 
the edge pendant to the leaf i in tree t. In particular, dim ,8* = 2n — 4, and the 
(n + I) -dimensional lineality space of the reduced fan coincides with the row span 
of A. 

Proof. First, we rewrite the equations in terms of the coordinates of reduced weights 
then apply Lemma |4.8| Fix an upper tree s and write v and u> for v s and u> s 
respectively. The following equalities hold: 



3=1 



E 



n+1 



= 1 



1 



V 1 < i < n. 



These are precisely the linear equations described by matrix A. 

We now prove that these equations characterize the space. To simplify notation, 
let ip be the surjective map ip(p) = (n t (p\ w ),Pi, n +u ■ ■ ■ >Pn,n+i) for any lifting p of 
a tree metric with tree t. We proceed by dimensionality arguments. We know that 
rk(A) = n + 1, so dim£>* < 3n — 3 — (n + 1) = 2n — 4. Our goal is to show that 
equality holds. It will suffice to show that the dimension of the lineality space of 
the "reduced fan" equals n + 1. 
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By construction, the shift vectors {h a : 1 < a < n + 1} represent tree metrics 
associated to a degeneration of the trivalent tree t with two nodes and one edge: a 
leaf labeled a and the other leaf labeled by the set {1, . . . , a, . . . , n+ 1}. Hence, these 
tree metrics can be expressed as points h a — ip(h a ) in R 3 ™ -3 and they generate 
an (n + l)-dimensional vector space. These points are precisely the rows of A as 
described in the statement. Hence, it suffices to show that these vectors span the 
lineality space of the "reduced fan" . 

Fix any trivalent tree so onn+1 taxa. Given p £ R 3n ~ 3 in the lineality space 
of the reduced fan, by definition we have (p,v s ) — (p,v s °) for all trees s. By 
construction, p lies in the image of so fix q with p = ip(q). Thus, (q, uj s ) = (p, v s ) 
for all s by Q and so (q,uj s ) = (q,uj s °) for all s. By definition, we have that q is 
in the lineality space of the BME fan and so it is a linear combination of the shift 
vectors. After applying the map tp, the same holds for p and the vectors h a , and 
the result follows. □ 

4.3. Analysis of the reduced BME polytope. In this section we focus on com- 
binatorial properties of the reduced BME polytope and the behavior of the vertices 
of the BME polytope under the projection map at, as i varies along the set of com- 
binatorial types of trees on n taxa. In particular, we give a complete description 
of the vertices for up to six taxa (see Table [2|. As we mentioned earlier, two tree 
topologies onn + 1 taxa can give the same vertex in the polytope B l and vertices 
of the BME polytope can map to interior points in B* under the projection map. 
As Example |4.7| shows, for four taxa there exists a pair of tree topologies with the 
same associated reduced weight, but all fourteen reduced weights are still vertices 
of B\. Similarly, in the case of five taxa, a Polymake computation shows that all 94 
possible (out of 105) reduced weights {v s : s £ 3^} are vertices. This is no longer 
true for six taxa. 

By construction, the polytope B l encodes an optimization problem where we 
restrict our ambient space M\ 2 ) to the space of extensions of tree metrics with 
associated tree t. In terms of the BME fan, this means cutting out the fan with 
the (2n - 3)-dimensional cone R + S? t C M^ 1 ). Note that by intersecting the BME 
chambers with this cone, we may get a cone with dimension less than 2n — 3. 
Moreover, it could very well happen that this intersection is just the lineality space 
K(a t (/i a ) : 1 < a < n + 1) of the cone. This would imply that the point v s lies in 
the interior of the polytope. This is indeed what happens for six taxa, as we have 
found through computation: 

Proposition 4.10. Let t = ((1, 2), (3, 4), (5, 6)) be the snowflake tree. Then the 
reduced polytope B\ is generated by the 792 reduced weights ( out of the possible 945 
reduced trivalent points) and it has 780 vertices and 83 227 facets. The remaining 
twelve reduced trivalent weights v s that are not vertices of B\ lie in the interior of 
the polytope. They are associated to pairs of trivalent trees with topologies: 
(1, ((((2,3) ,(4,6)) ,7) ,5)) (1,((((2,4),(3,6)),7),5)) 
(1, ((((2, 3), 7), (4, 6)), 5)) (1,((((2,3),7),(4,5)),6)) 
(1,(((2,3),((4,6),7)),5)) (1,(((2,5),((4,6),7)),3)) 
(1, ((((2, 5), (3, 6)), 7), 4)) (1,((((2,6),(3,5)),7),4)) 
(1, ((((2, 5), 7), (3, 6)), 4)) (1,((((2,5),7),(4,6)),3)) 
(1,(((2,5),((3,6),7)),4)) (1,(((2,4),((3,6),7)),5)) 
(1, ((((2, 6), 7), (3, 5)), 4)) (1, ((((2, 6), 7), (4, 5)), 3)) 
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(1, (((2,6) ,((3,5) ,7)) ,4)) (1,(((2,4),((3,5),7)),6)) 
(1, ((((2, 3), (4, 5)) ,7) ,6)) (1, ((((2, 4), (3, 5)), 7), 6)) 
(1, (((2, 3), ((4, 5), 7)) ,6)) (1,(((2,6),((4,5),7)),3)) 
(1, ((((2, 4) ,7), (3, 6)) ,5)) (1, ((((2, 4), 7), (3, 5)), 6)) 
(1, ((((2,5) ,(4,6)) ,7) ,3)) (1, ((((2, 6), (4, 5)), 7), 3)) 

Similarly if t is the lower tree (1, (((3, 4), 6), 5), 2) (the caterpillar tree), then the 
polytope B\ has 804 distinct reduced weights, 800 vertices and 116 701 facets. In 
this case, all four reduced trivalent weights v s that are not vertices of B\ lie in the 
interior. In this case, each point corresponds to a single topology and they are: 
(1, ((((2, (3, 5)), 7), 4), 6)) 
(1, ((((2, 6), 3), 7), (4, 5))) 
(1, ((((2, (4, 5)), 7), 3), 6)) 
(1, ((((2, 6), 4), 7), (3, 5))) 

From the previous examples, we see that in the case of four and five taxa, all 
reduced points are vertices. And for six taxa, reduced points are either vertices or 
interior points (Proposition 4.10|. Thus, it is natural to ask if these are the only 
two possibilities: 

Question. For n > 7 and any tree t G 3F n) are all reduced trivalent points either 
vertices or interior points of the reduced polytope B ? 

We expect the answer to be positive, provided the projection map at is generic. 

We now switch gears and focus on the number of upper BME trees we can obtain 
from a lifting of a given tree metric with topology t. This study will highlight 
the behavior of "rogue taxa." Equivalently, we want to know how many positive 
reduced cones ^/{B ) (s trivalent tree on n + 1 taxa) are non-empty. We provide 
a complete answer for up to six taxa in Table [2] below. 

The next natural question to ask is what are the asymptotics (or provide an 
upper bound) of the number of such non-empty positive reduced cones. As a first 
attempt, we give some insight about which topologies can be ruled out for upper 
BME trees. In other words, which are the blocking topologies for upper trees. 

Definition 4.11. Fix t G and let v s be the reduced weight for a trivalent tree 
sG ^n+i- We define a partial order on the set {v s : s 6 ^ n +i} as follows: v s >- v s 
if and only if (v s )i < {v s )i for all 1 < I < 3n — 3. We say s blocks s' if v s >~ v s . 

Lemma 4.12. Let t G 2T n , and s, s' G be such that s blocks s' . Then, s' 

cannot be a BME tree for any lifting D of D G . 

Proof. It suffices to show that for any D, X(s,D) < \(s',D), and this follows 
because D has non-negative entries. □ 

We illustrate with examples on five taxa. 

Example 4.13. Let t = (1, ((3, 4), 5), 2). Out of all possible 94 vertices in B f , 
there are 19 reduced vertices that are blocked by other vertices, out of 20 empty 
positive reduced cones. The blocking relation is described in Figure [^] and it gives 
26 blocking upper tree topologies. We simplify the picture by reducing the relation 
modulo relabeling of all leaves involved in each chain and that fix the lower tree t. 

In particular we see that out of the 94 possible BME reduced vertices for t, we 
can rule out 19 of these vertices for upper trees by "blocking" relations. o 
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Figure 4. The blocking relations (up to symmetry) for trees on 
six taxa. Pairs of trees in a column are a single blocking relation, 
with the tree in the second row blocking the corresponding tree in 
the first row. Note that these blocking relations do not come from 
Theorem 



3.5 



Unfortunately, this partial order set is not a sufficient criterion to determine if 
a tree on n + 1 taxa can be an upper tree or not. In particular, it cannot explain 
the obstruction to exchange subtrees "over" the new pendant edge (Theorem 3.5), 
except in the case of quartet trees. However, understanding the blocking relation 
can give an upper bound for the asymptotics of the upper BME trees. 

We end this section with a table discribing the relation between the BME and 
reduced BME polytopes for up to six taxa. In the case of six taxa, we have two 
combinatorial types of lower trees and each one will label a row in our table. The 
row starting with "6a" indicates the caterpillar tree on six taxa, whereas "6b" refers 
to the snowflake tree (see Proposition 



4.101 



n 


dim. 


# vertices 


# void upper 


/-vector of reduced BME 




BME 


red. 


BME 


red. 


trees for t 


positive cones 


3 


2 


2 


3 


3 





(0,0,3) 


4 


5 


4 


15 


14 


2 


(1,0,0,0,13) 


5 


9 


6 


105 


94 


20 


(16,1,6,0,0,0,71) 


6a 


14 


8 


945 


800 


208 


(160,32,98,10,39,0,0,461) 


6b 


14 


8 


945 


780 


154 


(123,0,144,9,39,0,0,0,465) 



Table 2. A comparison between the BME and reduced BME 
polytopes for up to six taxa. In the case of six taxa, we have 
more than one combinatorial type for the lower tree t. Each vec- 
tor in the last column gives the number of reduced BME positive 
cones classified by dimension, starting from dimension n + 1 and 
up to dimension 3n — 3. The lowest dimensional ones correspond 
to reduced weights of forbidden upper BME trees, since they lie 
in the linear space spanned by the shift vectors. The discrepancy 
between the first entry of these vectors and the entry of the column 
indicating the number of voided upper trees reflects that several of 
these void trees have equal reduced weights. 



We conclude with an interesting computationally challenging question: 
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Question. What are the asymptotics of the number of vertices of the B l and of the 
number of upper BME trees and upper BME reduced trees for different combinatorial 
types of lower trees t ? 

4.4. The rogue taxon effect for four taxa. The extremal rays of each reduced 
cone can be interpreted to give precise information on the rogue taxon effect. In 
this section, we explore the reduced polyhedral cone associated to the lower tree 
((1, 2), (3, 4)) and the upper tree (((1, 5), 3), (2, 4)). Up to symmetry, this is the 
only lower/upper combination for this number of taxa such that the new taxon has 
"rogue" behavior. By understanding the extremal rays of the polyhedral cone, we 
establish Propositions |4.14| and |4.15| 





bi 


b 2 


b 3 


b 4 
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X2 


x 3 




Cl 


4 





3 


3 


1 














C2 


3 





3 





1 














C3 


1 











1 





3 








C4 














1 





4 


1 


1 


C5 














1 





3 





3 


ei 


1 


1 


1 


1 

















e2 


1 


1 


1 




















e3 


1 





1 


1 

















e 4 


1 





1 




















es 


1 


























h 

















1 


1 


1 


1 


h 




















1 


1 


1 


h 




















1 








h 


























1 


h 1 


1 














1 











h 2 





1 














1 








h 3 








1 














1 





hi 











1 














1 



Table 3. The extremal rays of the polyhedral cone X s (t) for four 
lower taxa for t = ((1, 2), (3, 4)) and s = (((1, 5), 3), (2, 4)). The 
rows represent the rays. Labeling conventions for rows and columns 
are described in the text. 



Table M gives the extremal rays of the cone X s (t) . We follow the notation of 
Example ]4.7| to label the columns. The rows label the extremal rays of the cone, 
and are divided into sections. The first section, labeled with c, are the rays which 
give branch length/extra taxon distances with a nontrivial internal branch length 
for the lower tree. This is visible because of the 1 in the bo column. These rays 
are interesting as they represent the "minimal" rogue taxon examples. We analyze 
these Ci in more detail below. 

The second section, labeled with e, /, and h, shows how the pendant (leading to 
a leaf) branch lengths of the lower tree and the distances to the new taxon can be 
modified without changing the upper tree. That is, any positive multiple of these 
vectors can be added to a point in the cone while staying in the same polyhedral 
cone. For instance, e 4 says that we can increase the branch lengths b x and b 3 
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simultaneously while maintaining the same upper tree. The ray /3, for example, 
(which is all zero except for the x 2 column) , says that we can increase the distance 
of the new taxon to the second original taxon without changing the upper tree. 
The hi are simply the shift vectors corresponding to the pendant branches. Thus 
hi means that we can increase the ith pendant branch length while increasing the 
distance of the new taxon to the ith original taxon without changing the upper 
tree. 

These extremal rays can give some sufficient conditions for rogue taxon behavior. 
We specify branch lengths of quartets by a vector giving branch lengths in the order 
(bo, . . . ,t>4). We say that a vector x is a rogue vector for a branch length vector b 
if the BME tree for the combined data as in Table § is the tree (((1, 5), 3), (2, 4)). 
We will call the cone given by positive linear combinations of the set 

{(0, 1, 1, 1, 1), (0, 1, 1, 1, 0), (0, 1, 0, 1, 1), (0, 1, 0, 1, 0), (0, 1, 0, 0, 0)} 

the extension cone. Any element from this cone can be added to a branch length 
set without changing the polyhedral cone; this can be seen by looking at the 
vectors above. 

Note that any vector satisfying < xi < x% < min(x 2 , £4) sits in the cone 
generated by the fi restricted to their last four coordinates. Therefore we conclude: 

Proposition 4.14. Any vector satisfying < X\ < x 3 < min(a;2, X4) is a rogue vec- 
tor for any tree with branch length vector given by either (1, 4, 0, 3, 3) or (1, 3, 0, 3, 0) 
plus any element of the extension cone. 

The next proposition gives rogue criteria for a quartet tree with arbitrary internal 
branch length. The proof is simple: just look at C5 in Table [3J which shows that 
(0, 3, 0, 3) is a rogue vector for the quartet with trivial pendant branch lengths and 
internal branch length 1. 

Proposition 4.15. Any quartet tree has a rogue vector with an entry greater than 
or equal to three times the internal branch length of the lower tree. 

Although the above propositions do give some conditions on when the rogue 
taxon effect appears for four taxa, they do not specify how likely are we to end 
up in a rogue taxon situation. They also give no information about trees on larger 
number of taxa. In the next section, we gain some intuition about these questions 
via simulation. 

4.5. Simulations. Here we describe simulations performed to better understand 
the rogue taxon effect as it might appear in biological data. These simulations show 
that, at least for small numbers of taxa, the rogue taxon effect is common when the 
extra distances are chosen without reference to the original tree. They also suggest 
that the effect gets worse as the number of taxa increases. 

We assume a random distribution for the branch lengths and distances to the 
new taxon. Such simulations are not the only way to address these sorts of ques- 
tions. Volume computations of, e.g., spheres intersected with our polyhedral cones 
are in principle possible, but they do not seem to admit a closed form solution. 
Thus our understanding of such volumes still depends strongly on Monte Carlo 



simulations (Eickmeyer et al. 20081. Furthermore, such a volume may give less 
practical information than simulation using a reasonable model of branch lengths. 

To better understand the frequency with which the rogue taxon phenomenon can 
occur, we simulate using the exponential distribution. Although a simple arbitrary 
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Table 4. Simulation results for 10 7 exponentially distributed 
branch lengths and distances to rogue taxa. The columns are la- 
beled by the topology of the lower tree. The numbers in the table 
represent the fraction of time that the corresponding Robinson- 
Foulds distance between the upper and lower trees appeared via 
the rogue taxon effect. 



choice, the exponential distribution is realistic enough to be a branch length prior 



for Bayesian phylogenetic inference (Ronquist et al. 2005). For a given lower tree, 



we generate branch lengths for that tree according to the mean one exponential 
distribution, then generate distances to the extra taxon via the exponential dis- 
tribution with mean equal to the expected pairwise distance between tips of the 
tree. Then, we find the upper tree (i.e. the BME tree for the original data set plus 
the rogue taxon) and check to see how many bipartitions of the upper tree (re- 
stricted to the lower taxa) are not contained in the lower tree. This number is the 
Robinson- Foulds distance between the upper and lower trees used in Section [XT] 

The results of 10 7 exponentially drawn branch lengths are shown in Table |4l it 
shows that a taxon added with random data can substantially alter the structure 
of the phylogenetic tree. Indeed, almost 30% of the lifted four taxon trees do not 
contain the original topology, growing to almost 50% for five taxa, then almost 62% 
for the six taxon topologies. 

We emphasize that such simulations do not paint an accurate picture of the rogue 
taxon effect for real data. Indeed, even the worst data does not have completely 
random distances: even "random" sequence data will not have random distances 
to the rest of the tree. Nevertheless, we believe that these results indicate that 
this area merits further investigation and that the effective volume of these "rogue" 
polyhedral cones is not small. 

In the reduced BME setting it can happen that multiple bifurcating upper trees 
are associated with a cone of the reduced normal fan for a given lower tree. That 
is, the trees all have the same BME length for given lower tree branch lengths and 
rogue taxon distances. We have observed in the example presented here that when 
there are these multiple trees, the Robinson-Foulds distance between the lower tree 
and these multiple upper trees (restricted to the lower taxa) for a given cone are 
equal. It would be interesting to know if this is true in the general case. 

The equivalent fact for the quartet distance is not true. In the case of the lower 
tree being (((1, 2), 5), ((3, 4), 6)), there is a cone of the reduced normal fan associ- 
ated with both (1, ((((2, 3), (4, 7)), 6), 5)) and (1, ((((2, 6), (4, 7)), 3), 5)). Restricting 
to the lower taxa, these trees are (1, ((((2, 3), 4), 6), 5)) and (1, ((((2, 6), 4), 3), 5)), 
which have quartet distances 10 and 11, respectively, to the lower tree. 
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5. Conclusions and future directions 

We have investigated the effect of adding an extra "rogue" taxon into a phylo- 
genetic data set for BME phylogenetic inference. We have shown that rogue taxa 
can have significant though not arbitrary effects on the tree. For a small number 
of taxa, we can delineate the domain of the rogue taxon effect. Simulations show 
that the rogue taxon effect is very significant when the data for the rogue taxon is 
chosen randomly without reference to the topology of the original tree. 

The results presented here may have algorithmic consequences for phylogenetic 
inference. It is common for inference programs to start with a tree on three taxa 
then build a tree by adding taxa sequentially. Software packages using sequential 



taxon addition, such as PHYLIP ( |Felsenstein| 1995) and fastDNAml (Olsen et al 



1994) do optimize the tree after addition using rearrangements; the question of 
strict sequential addition performance is still important in order to determine the 
amount of post-addition optimization required. Furthermore, "evolutionary place- 
ment algorithms" for large amounts of sequence data have been proposed whereby 



a "query" sequences are inserted into a fixed "reference tree" (Von Mering et al 



2007 Berger and Stamatakis[ |2009[ ) . The accuracy of such algorithms compared to 



traditional phylogenetics algorithms can be seen as an aspect of the rogue taxon 
problem. 

An interesting next direction would be to consider situations where rogue taxa 
do not have arbitrary data, but appear via misspecified evolutionary models. This 
will hopefully give a clearer understanding of the actual impact of rogue taxa. It 
would also be interesting to see if some of the results presented here also extend to 
other inference criteria, such as parsimony or maximum likelihood. Some results, 
such as the simulation results presented above, will certainly be different in this 
new setting but others may correspond well. Maximum likelihood and parsimony 
are considerably more difficult to analyze, but hopefully the results presented here 
can act as a guide. 
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